Search Results: "tytso"

Matthew Garrett: ext4, application expectations and power management

Edited to add: Sorry, it turns out that ext4 does now have a heuristic to deal with this case. However, it seems that btrfs doesn't. My point remains, though - designing a filesystem in such a way that a useful behaviour is impossible is unhelpful for the vast majority of application writers, even if it does improve performance in some other use cases.

Further edit: It's mentioned in the comments that btrfs will force a flush of data when a rename is performed as of 2.6.30. That prevents the data loss, but it means that we're still stuck with disk access when we'd be happy with that being batched up for the future. It feels like we're optimising for the wrong case here.

Original text:

There's been a certain amount of discussion about behavioural differences between ext3 and ext4[1], most notably due to ext4's increased window of opportunity for files to end up empty due to both a longer commit window and delayed allocation of blocks in order to obtain a more pleasing on-disk layout. The applications that failed hardest were doing open("foo", O_TRUNC), write(), close() and then being surprised when they got zero length files back after a crash. That's fine. That was always stupid. Asking the filesystem to truncate a file and then writing to it is an invitation to failure - there's clearly no way for it to intuit the correct answer here. In the end this has been avoided by avoiding delayed allocation when writing to a file that's just been truncated, so everything's fine.

However, there's another case that also breaks. A common way of saving files is to open("foo.tmp"), write(), close() and then rename("foo.tmp", "foo"). The mindset here is that a crash will either result in foo.tmp being zero length, foo still being the original file or foo being your new data. The important aspect of this is that the desired behaviour of this code is that foo will contain either the original data or the new data. You may suffer data loss, but you won't suffer complete data loss - the application state will be consistent.

When used with its (default) data=ordered journal option, ext3 provided these semantics. ext4 doesn't. Instead, if you want to ensure that your data doesn't get trampled, it's necessary to fsync() before closing in order to make sure it hits disk. Otherwise the rename can occur before the data is written, and you're back to a zero length file. ext4 doesn't make guarantees about whether data will be flushed before metadata is written.

Now, POSIX says this is fine, so any application that expected this behaviour is already broken by definition. But this is rules lawyering. POSIX says that many things that are not useful are fine, but doesn't exist for the pleasure of sadistic OS implementors. POSIX exists to allow application writers to write useful applications. If you interpret POSIX in such a way that gains you some benefit but shafts a large number of application writers then people are going to be reluctant to use your code. You're no longer a general purpose filesystem - you're a filesystem that's only suitable for people who write code with the expectation that their OS developers are actively trying to fuck them over. I'm sure Oracle deals with this case fine, but I also suspect that most people who work on writing Oracle on a daily basis have very, very unfulfilling lives.

But anyway. We can go and fix every single piece of software that saves files to make sure that it fsync()s, and we can avoid this problem. We can probably even do it fairly quickly, thanks to us having the source code to all of it. A lot of this code lives in libraries and can be fixed up without needing to touch every application. It's not the end of the world.

So why do I still think it's a bad idea?

It's simple. open(),write(),close(),rename() and open(),write(),fsync(),close(),rename(), are not semantically equivalent. One is "give me either the original data or the new data"[2]. The other is "always give me the new data". This is an important distinction. fsync() means that we've sent the data to the disk[3]. And, in general, that means that we've had to spin the disk up.

So, on the one hand, we're trying to use things like relatime to batch data to reduce the amount of time a disk has to be spun up. And on the other hand, we're moving to filesystems that require us to generate more io in order to guarantee that our data hits disk, which is a guarantee we often don't want anyway! Users will be fine with losing their most recent changes to preferences if a machine crashes. They will not be fine with losing the entirity of their preferences. Arguing that applications need to use fsync() and are otherwise broken is ignoring the important difference between these use cases. It's no longer going to be possible to spin down a disk when any software is running at all, since otherwise it's probably going to write something and then have to fsync it out of sheer paranoia that something bad will happen. And then probably fsync the directory as well, because what if someone writes an even more pathological filesystem. And the disks sit there spinning gently and chitter away as they write tiny files[4] and never spin down and the polar bears all drown in the bitter tears of application developers who are forced to drink so much to forget that they all die of acute liver failure by the age of 35 and where are we then oh yes we're screwed.

So. I said we could fix up applications fairly easily. But to do that, we need an interface that lets us do the right thing. The behaviour application writers want is one which ext4 doesn't appear to provide. Can that be fixed, please?

[1] xfs behaves like ext4 in this respect, so the obvious argument is that all our applications have been broken for years and so why are you complaining now. To which the obvious response is "Approximately anyone who ever used xfs expected their data to vanish if their machine crashed so nobody used it by default and seriously who gives a shit". xfs is a wonderful filesystem for all sorts of things, but it's lousy for desktop use for precisely this reason.

[2] Yes, ok, we've just established that it actually isn't that in the same way that GMT isn't UTC and battery refers to a collection of individual cells and so you don't usually put multiple batteries in your bike lights, but the point is that this is, for all practical intents and purposes, an unimportant distinction and not one people should have to care about in their daily lives.

[3] The disk is free to sit there bored for arbitrary periods of time before it does anything, but that's fine, because the OS is behaving correctly. Sigh.

[4] Dear filesystem writers - application developers like writing lots of tiny files, because it makes a large number of things significantly easier. This is fine because sheer filesystem performance is not high on the list of priorities of a typical application developer. The answer is not "Oh, you should all use sqlite". If the only effective way to use your filesystem is to use a database instead, then that indicates that you have not written a filesystem that is useful to typical application developers who enjoy storing things in files rather than binary blobs that end up with an entirely different set of pathological behaviours. If I wanted all my data to be in oracle then I wouldn't need a fucking filesystem in the first place, would I?

Theodore Ts'o: Delayed allocation and the zero-length file problem

A recent Ubuntu bug has gotten slashdotted, and has started raising a lot of questions about the safety of using ext4. I ve actually been meaning to blog about this for a week or so, but between a bout of the stomach flu and a huge todo list at work, I simply haven t had the time. The essential problem is that ext4 implements something called delayed allocation. Delayed allocation isn t new to Linux; xfs has had delayed allocation for years. Pretty much all modern file systems have delayed allocation, according to the Wikipedia Allocate-on-flush article, this includes HFS+, Reiser4, and ZFS; btrfs has this property as well. Delayed allocation is a major win for performance, both because it allows writes to be streamed more efficiently to disk, and because it can reduce file fragmentation so that later on they can be read more efficiently from disk. This sounds like a good thing, right? It is, except for badly written applications that don t use fsync() or fdatasync(). Application writers had gotten lazy, because ext3 by default has a commit interval of 5 seconds, and and uses a journalling mode called data=ordered. What does this mean? The journalling mode data=ordered means that before the commit takes place, any data blocks are
associated with inodes that are about to be committed in that transaction will be forced out to disk. This is primarily done for security reasons; if this is not done (which would be the case if the disk is mounted with the mount option data=writeback), then any newly allocated blocks might still contain previous data belonging to some other file or user, and after a crash, accessing that file might result in a user seeing unitialized data that had previously belonged to another user (say, their e-mail or their p0rn), which would be a Bad Thing from a security perspective. However, this had the side effect of essentially guaranteeing that anything that had been written was guaranteed to be on disk after 5 seconds. (This is somewhat modified if you are running on batteries
and have enabled laptop mode, but we ll ignore that for the purposes of this discussion.) Since ext3 became the dominant filesystem for Linux, application writers and users have started depending on this, and so they become shocked and angry when their system locks up and they lose data even though POSIX never really made any such guarantee. This become especially noticeable on Ubuntu, which uses many proprietary, binary-only drivers, which caused some Ubuntu systems to become highly unreliable, especially for Alpha releases of Ubuntu Jaunty, with the net result that some Ubuntu users have become used to their machines regularly crashing. (I use bleeding edge kernels, and I don t see the kind of unreliability that apparently at least some Ubuntu users are seeing, which came as quite a surprise to me.) So what are the solutions to this? One thing is that the applications could simply be rewritten to properly use fsync() and fdatasync(). This is what is required by POSIX, if you want to be sure that data has gotten written to stable storage. Some folks have resisted this suggestions, on two grounds; first, that it s too hard to fix all of the applications out there, and second, that fsync() is too slow. This perception that fsync() is too slow was most recently caused by a problem with Firefox 3.0. As Mike Shaver put it:

On some rather common Linux configurations, especially using the ext3 filesystem in the data=ordered mode, calling fsync doesn t just flush out the data for the file it s called on, but rather on all the buffered data for that filesystem.

Fundamentally, the problem is caused by data=ordered mode. This problem can be avoided by mounting the filesystem using data=writeback or by using a filesystem that supports delayed allocation such as ext4. This is because if you have a small sqllite database which you are fsync(), and in another process you are writing a large 2 megabyte file, the 2 megabyte file won t be be allocated right away, and so the fsync operation will not force the dirty blocks of that 2 megabyte file to disk; since the blocks haven t been allocated yet, there is no security issue to worry about with the previous contents of newly allocated blocks if the system were to crash at that point. Another solution is a set of patches to ext4 that has been queued for 2.6.30 merge window. These three patches (with git id s bf1b69c0, f32b730a, and 8411e347) will cause a file to have any delayed allocation blocks to be allocated immediately when a file is replaced. This gets done for files which were truncated using ftruncate() or opened via O_TRUNC when the file is closed, and when a file is renamed on top of an existing file. This solves the most annoying set of problems where an existing file gets rewritten, and thanks to the delayed allocation semantics, that existing file gets replaced with a zero-length file. However, it will not solve the problem for newly created files, of course, which would have delayed allocation semantics. Yet another solution would be to mount ext4 volumes with the nodelalloc mount option. This will cause a significant performance hit, but apparently some Ubuntu users are happy using proprietary Nvidia drivers, even if it means that when they are done playing World of Goo, quitting the game causes the system to hang and they must hard-reset the system. For those users, it may be that nodelalloc is the right solution for now personally, I would consider that kind of system instability to be completely unacceptable, but I guess gamers have very different priorities than I do. A final solution which might not be that hard to implement would be a new mount option, data=alloc-on-commit. This would work much like data=ordered, with the additional constraint that all blocks that had delayed allocation would be allocated and forced out to disk before a commit takes place. This would probably give slightly better performance compared to mounting with nodelalloc, but it shares many of the disadvantages of nodelalloc, including making fsync() to be potentially very slow because it would force all dirty blocks to be forced out to disk once again. What s the best path forward? For now, what I would recommend to Ubuntu gamers whose systems crash all the time and who want to use ext4, to use the nodelalloc mount option. I haven t quantified what the performance penalty will be for this mode of operation, but the performance will be better than ext3, and at least this way they won t have to worry about files getting lost as a result of delayed allocation. Long term, application writers who are worried about files getting lost on an unclena shutdown really should use fsync. Modern filesystems are all going to be using delayed allocation, because of its inherent performance benefits, and whether you think the future belongs to ZFS, or btrfs, or XFS, or ext4 all of these filesystems used delayed allocation. What do you think? Do you think all of these filesystems have gotten things wrong, and delayed allocation is evil? Should I try to implement a data=alloc-on-commit mount option for ext4? Should we try to fix applications to properly use fsync() and fdatasync()? Related posts (automatically generated):

Don t fear the fsync! After reading the comments on my earlier post, Delayed allocation...
SSD s, Journaling, and noatime/relatime On occasion, you will see the advice that the ext3...
Ext4 is now the primary filesystem on my laptop Over the weekend, I converted my laptop to use the...

Jon Dowland: sorting out backups

I'm busy re-working how I handle backups of my personal files. A very useful recent discussion on some available features and technologies is at Theodore T'so's blog: http://thunk.org/tytso/blog/2009/01/12/wanted-incremental-backup-solutions-that-use-a-database/. From that post I discovered a package 'archfs' which provides a FUSE-powered filesystem view of an rdiff-backup history. I found a stale ITP for this package and have adopted it, so keep your eyes out for that. Looking at how I am going to approach things, I think the Debian package of anacron might need some work. I'm planning on looking at the issues there a bit more closely and feeding back to the maintainer when I do.

Lucas Nussbaum: Creating a large file without zeroing it: update

Given the large number of comments I got (26!), I feel obliged to post a summary of what was said. First, the problem:
I want to create a large file (let s say 10 GB) to use as swap space. This file can t be a sparse file (a file with holes, see wikipedia if you don t know about sparse files).
Since I m going to mkswap it, I don t care about the data that is actually in that file after creating it. The stupid way (but only solution on ext3) to create it is to fill it with zeroes, with is very inefficient. Theodore Tso provided more information in a comment, which I m copying here:

Yes, it will work on ext4. A convenient which makes this easy to use can be found here at http://sandeen.fedorapeople.org/utilities/fallocate.c. It was written by Eric Sandeen, a former XFS developer who now works for Red Hat, who has been a big help making sure ext4 will be ready for Fedora and Red Hat Enterprise Linux. (Well, I guess I shouldn t call him a former XFS developer since he still contributes patches to XFS now and then, but he s spending rather more time on ext4 these days.) One warning about the program; it calls the fallocate system call directly, and it doesn t quite have the right magic architecture-specific magic for certain architectures which have various restrictions on how arguments need to be passed to system calls. In particular, IIRC, I believe there will be issues on the s390 and powerpc architectures. The real right answer is to get fallocate into glibc; folks with pull into making glibc do the right thing, please talk to me. Glibc does have posix_fallocate(), which implements the POSIX interface. posix_fallocate() is wired to use the fallocate system call, for sufficiently modern versions of glibc. However, posix_fallocate() is probablematic for some applications; the problem is that for filesystems that don t support fallocate(), posix_fallocate() will simulate it by writing all zeros to the file. However, this is not necessarily the right thing to do; there are some applications that want fallocate() for speed reasons, but if the filesystem doesn t support it, they want to receive the ENOSPC error message, so they can try some other fallback which might or might not involve writing all zero s to the file. The other shortcoming with posix_fallocate() is that it doesn t support the FALLOC_FL_KEEP_SIZE flag. What this flag allows you to do is to allocate disk blocks to the file, but not to modify the i_size parameter. This allows you to allocate space for files such as log files and mail spool files so they will be contiguous on disk, but since i_size is not modified, programs that append to file won t get confused, and tail -f will continue to work. For example, if you know that your log files are normally approximately 10 megs a day, you can fallocate 10 megabytes, and then the log file will be contiguous on disk, and the space is guaranteed to be there (since it is already allocated). When you compress the log file at the end of the day, if the log file ended up being slightly smaller than 10 megs, the extra blocks will be discarded when you compress the file, or if you like, you can explicitly trim away the excess using ftruncate().

fallocate works fine: creating a 20 GB file is almost immediate. Also, syncing or umounting the filesystem is also immediate, and reading the file returns only zeros. I m not sure how it is implemented, but it looks nice :-). However, it still doesn t solve my initial problem: mkswap works, but not swapon:

:/tmp# touch tmp
:/tmp# /root/fallocate -l 10g tmp
:/tmp# ls -lh tmp
-rw-r--r-- 1 root root 10G Mar  3 11:01 tmp
:/tmp# du tmp
10485764	tmp
:/tmp# mkswap tmp
Setting up swapspace version 1, size = 10737414 kB
no label, UUID=a316ce8e-cf33-412b-8dc0-e10d9f2ebdbb
:/tmp# strace swapon tmp
[...]
swapon( /tmp/tmp )                      = -1 EINVAL (Invalid argument)
write(2,  swapon: tmp: Invalid argument\n , 30swapon: tmp: Invalid argument
) = 30
exit_group(-1)

(swapon works fine if the file is created normally without using fallocate()). Any other ideas?

Theodore Ts'o: SSD s, Journaling, and noatime/relatime

On occasion, you will see the advice that the ext3 file system is not suitable for Solid State Disks (SSD s) due to the extra writes caused by journaling and so Linux users using SSD s should use ext2 instead. However, is this folk wisdom actually true? This weekend, I decided to measure exactly what the write overhead of journaling actually is in actual practice. For this experiment I used ext4, since I recently added a feature to track the amount of writes to the file system over its lifetime (to better gauge the wear and tear on an SSD). Ext4 also has the advantage that (starting in 2.6.29), it can support operations with and without a journal, allowing me to do a controlled experiment where I could manipulate only that one variable. The test workload I chose was a simple one:

Clone a git repository containing a linux source tree
Compile the linux source tree using make -j2
Remove the object files by running make clean

For the first test, I ran the test using no special mount options, and the only difference being the presence or absence of the has_journal feature. (That is, the first file system was created using mke2fs -t ext4 /dev/closure/testext4, while the second file system was created using mke2fs -t ext4 -O ^has_journal /dev/closure/testext4.)

Amount of data written (in megabytes) on an ext4 filesystem
Operation	with journal	w/o journal	percent change
git clone	367.7	353.0	4.00%
make	231.1	203.4	12.0%
make clean	14.6	7.7	47.3%

What the results show is that metadata-heavy workloads, such as make clean, do result in almost twice the amount data written to disk. This is to be expected, since all changes to metadata blocks are first written to the journal and the journal transaction committed before the metadata is written to their final location on disk. However, for more common workloads where we are writing data as well as modifying filesystem metadata blocks, the difference is much smaller: 4% for the git clone, and 12% for the actual kernel compile. The noatime mount option Can we do better? Yes, if we mount the file system using the noatime mount option:

Amount of data written (in megabytes) on an ext4 filesystem mounted with noatime
Operation	with journal	w/o journal	percent change
git clone	367.0	353.0	3.81%
make	207.6	199.4	3.95%
make clean	6.45	3.73	42.17%

This reduces the extra cost of the journal in the git clone and make steps to be just under 4%. What this shows is that most of the extra meta-data cost without the noatime mount option was caused by update to the last update time for kernel source files and directories. The relatime mount option There is a newer alternative to the noatime mount option, relatime. The relatime mount option updates the last access time of a file only if the last modified or last inode changed time is newer than the last accessed time. This allows programs to be able to determine whether a file has been read size it was last modified. The usual (actually, only) example that is given of such an application is the mutt mail-reader, which uses the last accessed time to determine if new mail has been delivered to Unix mail spool files. Unfortunately, relatime is not free. As you can see below, it has roughly double the overhead of noatime (but roughly half the overhead of using the standard Posix atime semantics):

Amount of data written (in megabytes) on an ext4 filesystem mounted with relatime
Operation	with journal	w/o journal	percent change
git clone	366.6	353.0	3.71%
make	216.8	203.7	6.04%
make clean	13.34	6.97	45.75%

Personally, I don t think relatime is worth it. There are other ways of working around the issue with mutt for example, you can use Maildir-style mailboxes, or you can use mutt s check_mbox_size option. If the goal is to reduce unnecessary disk writes, I would mount my file systems using noatime, and use other workarounds as necessary. Alternatively, you can use chattr +A to set the noatime flag on all files and directories where you don t want noatime semantics, and then clear the flag for the Unix mbox files where you care about the atime updates. Since the noatime flag is inherited by default, you can get this behaviour by setting running chattr +A /mntpt right after the filesystem is first created and mounted; all files and directories created in that file system will have the noatime file inherited. Comparing ext3 and ext2 filesystems

Amount of data written (in megabytes) on an ext3 and ext2 filesystem
Operation	ext3	ext2	percent change
git clone	374.6	357.2	4.64%
make	230.9	204.4	11.48%
make clean	14.56	6.54	55.08%

Finally, just to round things out, I tried the same experiment using the ext3 and ext2 file systems. The difference between these results and the ones involving ext4 are the result of the fact that ext2 does not have the directory index feature (aka htree support), and both ext2 and ext3 do not have extents support, but rather use the less efficient indirect block scheme. The ext2 and ext3 allocators are also someone different from each other, and from ext4. Still, the results are substantially similar with the first set of Posix-compliant atime update numbers (I didn t bother to do noatime and relatime benchmark runs with ext2 and ext3, but I expect the results would be similar.) Conclusion So given all of this, where did the common folk wisdom that ext3 was not suitable for SSD s come from? Some of it may have been from people worrying too much about extreme workloads such as make clean ; but while doubling the write load sounds bad, going from 4MB to 7MB worth of writes isn t that much compared to the write load of actually doing the kernel compile or populating the kernel source tree. No, the problem was that first generation SSD s had a very bad problem with what has been called the write amplification effect , where a 4k write might cause a 128k region of the SSD to be erased and rewritten. In addition in order to provide safety against system crashes, ext3 has more synchronous write operations that is where ext3 waits for the write operation to be complete before moving on, and this caused a very pronounced and noticeable stuttering effect which was fairly annoying to users. However, the next generation of SSD s, such as Intel s X25-M SSD, have worked around the write amplification affect. What else have we learned? First of all, for normal workloads that include data writes, the overhead from journaling is actually relatively small (between 4 and 12%, depending on the workload). Further, than much of this overhead can be reduced by enabling the noatime option, with relatime providing some benefit, but ultimately if the goal is to reduce your file system s write load, especially where an SSD is involved, I would strongly recommend the use of noatime over relatime. Related posts (automatically generated):

Should Filesystems Be Optimized for SSD s? In one of the comments to my last blog entry,...
Aligning filesystems to an SSD s erase block size I recently purchased a new toy, an Intel X25-M SSD,...
Fast ext4 fsck times This wasn t one of the things we were explicitly engineering...

Theodore Ts'o: Fast ext4 fsck times, revisited

Last night I managed to finish up a rather satisfying improvement to ext4 s inode and block allocators. The ext4 s original allocator was actually a bit more simple-minded than ext3 s, in that it didn t implement the Orlov algorithm to spread out top-level directories for better filesystem aging. It also was buggy in certain ways, where it would return ENOSPC even when there were still plenty of inodes in the file system. So I had been working on extending ext3 s original Orlov allocator so it would work well with ext4. While I was at it, it occurred to me that one of the tricks I could play with ext4 s flex groups (which are higher-order collection of block groups), was to bias the block allocation algorithms such that the first block group in a flexgroup would be preferred for use by directories, and biased against data blocks for regular files. This meant that directory blocks would get clustered together, which cut a third off the time needed for e2fsck pass2:

Comparison of e2fsck times on an 32GB partition
Pass	ext4 old allocator					ext4 new allocator
	time (s)			I/O		time (s)			I/O
	real	user	system	MB read	MB/s	real	user	system	MB read	MB/s
1	6.69	4.06	0.90	82	12.25	6.70	3.63	1.58	82	12.23
2	13.34	2.30	3.78	133	9.97	4.24	1.27	2.46	133	31.36
3	0.02	0.01	0	1	63.85	0.01	0.01	0.01	1	82.69
4	0.28	0.27	0	0	0	0.23	0.22	0	0	0
5	2.60	2.31	0.03	1	0.38	2.42	2.15	0.07	1	0.41
Total	23.06	9.03	4.74	216	9.37	13.78	7.33	4.19	216	15.68

As you may recall from my previous observations on this blog, although we hadn t been explicitly engineering for this, a file system consistency check on an ext4 file system tends to be a factor of 6-8 faster than the e2fsck times on an equivalent ext3 file system, mainly due to the elimination of indirect blocks and the uninit_bg feature reducing the amount of disk reads necessary in e2fsck s pass 1. However, the ext4 layout optimizations didn t do much for e2fsck s pass 2. Well, the optimization of the block and inode allocators is complementary to the original ext4 fsck improvements, since it focuses on what we hadn t optimized the first time around: e2fsck pass 2 times have been cut by a third, and the overall fsck time has been cut by 40%. Not too shabby! Of course, we need to do more testing to make sure we haven t caused other file system benchmarks to degrade, although I m cautiously optimistic that this will end up being a net win. I suspect that some benchmarks will go up by a little, and others will go down a little, depending on how heavily the benchmark exercises directory operations versus sequential I/O patterns. If people want to test this new allocator, it is in the ext4 patch queue. If all goes well, I will hopefully be pushing it to Linus after 2.6.29 is released, at the next merge window.

horizontal separator For comparison s sake, here is a comparison of the fsck time of the same collection of files and directories, comparing ext3 and the original ext4 block and inode allocator. The file system in question is a 32GB install of Ubuntu Jaunty, with a personal home directory, a rather large Maildir directory, some linux kernel trees, and an e2fsprogs tree. It s basically the emergency environment I set up on my Netbook at FOSDEM. In all cases the file systems were freshly copied from the original root directory using the command rsync -axH / /mnt. It s actually a bit surprising to me that ext3 s pass 2 e2fsck times was that much better than e2fsck time under the old ext4 allocator. My previous experience has shown that the two are normally about the same, with a write throughput of around 9-10 MB/s on for e2fsck s pass 2 for both ext3 file systems and ext4 file systems with the original inode/block allocators. Hence, I would have expected ext3 s pass2 time to have been 12-13 seconds, and not 6. I m not sure how that happened, unless it was the luck of draw in terms of how things ended up getting allocated on disk. So I m not too sure what happened there, but overall things look quite good for ext4 and fsck times!
Comparison of e2fsck times on an 32GB partition

Pass ext3 ext4 old allocator

time (s) I/O time (s) I/O

real user system MB read MB/s real user system MB read MB/s

1 108.40 13.74 11.53 583 5.38 6.69 4.06 0.90 82 12.25

2 5.91 1.74 2.56 133 22.51 13.34 2.30 3.78 133 9.97

3 0.03 0.01 0 1 31.21 0.02 0.01 0 1 63.85

4 0.28 0.27 0 0 0 0.28 0.27 0 0 0

5 3.17 0.92 0.13 2 0.63 2.60 2.31 0.03 1 0.38

Total 118.15 16.75 14.25 718 6.08 23.06 9.03 4.74 216 9.37

Vital Statistics of the 32GB partition

312214 inodes used (14.89%)

263 non-contiguous files (0.1%)

198 non-contiguous directories (0.1%)

# of inodes with ind/dind/tind blocks: 0/0/0

Extent depth histogram: 292698/40

4388697 blocks used (52.32%)

0 bad blocks

1 large file

263549 regular files

28022 directories

5 character device files

1 block device file

5 fifos

615 links

20618 symbolic links (19450 fast symbolic links)

5 sockets

312820 files

Related posts (automatically generated):

Comparison of e2fsck times on an 32GB partition
Pass	ext3		ext4 old allocator
time (s)	I/O	time (s)	I/O
real	user	system	MB read	MB/s	real	user	system	MB read	MB/s
1	108.40	13.74	11.53	583	5.38	6.69	4.06	0.90	82	12.25
2	5.91	1.74	2.56	133	22.51	13.34	2.30	3.78	133	9.97
3	0.03	0.01	0	1	31.21	0.02	0.01	0	1	63.85
4	0.28	0.27	0	0	0	0.28	0.27	0	0	0
5	3.17	0.92	0.13	2	0.63	2.60	2.31	0.03	1	0.38
Total	118.15	16.75	14.25	718	6.08	23.06	9.03	4.74	216	9.37

Vital Statistics of the 32GB partition
312214	inodes used (14.89%)
263	non-contiguous files (0.1%)
198	non-contiguous directories (0.1%)
	# of inodes with ind/dind/tind blocks: 0/0/0
	Extent depth histogram: 292698/40
4388697	blocks used (52.32%)
0	bad blocks
1	large file

263549	regular files
28022	directories
5	character device files
1	block device file
5	fifos
615	links
20618	symbolic links (19450 fast symbolic links)
5	sockets
312820	files

Fast ext4 fsck times This wasn t one of the things we were explicitly engineering...
Wanted: Incremental Backup Solutions that Use a Database Dear Lazyweb, I m looking for recommendations for Open Source backup...
Ext4 is now the primary filesystem on my laptop Over the weekend, I converted my laptop to use the...

Theodore Ts'o: Fast ext4 fsck times, revisited

Last night I managed to finish up a rather satisfying improvement to ext4 s inode and block allocators. The ext4 s original allocator was actually a bit more simple-minded than ext3 s, in that it didn t implement the Orlov algorithm to spread out top-level directories for better filesystem aging. It also was buggy in certain ways, where it would return ENOSPC even when there were still plenty of inodes in the file system. So I had been working on extending ext3 s original Orlov allocator so it would work well with ext4. While I was at it, it occurred to me that one of the tricks I could play with ext4 s flex groups (which are higher-order collection of block groups), was to bias the block allocation algorithms such that the first block group in a flexgroup would be preferred for use by directories, and biased against data blocks for regular files. This meant that directory blocks would get clustered together, which cut a third off the time needed for e2fsck pass2:

Comparison of e2fsck times on an 32GB partition
Pass	ext4 old allocator					ext4 new allocator
	time (s)			I/O		time (s)			I/O
	real	user	system	MB read	MB/s	real	user	system	MB read	MB/s
1	6.69	4.06	0.90	82	12.25	6.70	3.63	1.58	82	12.23
2	13.34	2.30	3.78	133	9.97	4.24	1.27	2.46	133	31.36
3	0.02	0.01	0	1	63.85	0.01	0.01	0.01	1	82.69
4	0.28	0.27	0	0	0	0.23	0.22	0	0	0
5	2.60	2.31	0.03	1	0.38	2.42	2.15	0.07	1	0.41
Total	23.06	9.03	4.74	216	9.37	13.78	7.33	4.19	216	15.68

As you may recall from my previous observations on this blog, although we hadn t been explicitly engineering for this, a file system consistency check on an ext4 file system tends to be a factor of 6-8 faster than the e2fsck times on an equivalent ext3 file system, mainly due to the elimination of indirect blocks and the uninit_bg feature reducing the amount of disk reads necessary in e2fsck s pass 1. However, the ext4 layout optimizations didn t do much for e2fsck s pass 2. Well, the optimization of the block and inode allocators is complementary to the original ext4 fsck improvements, since it focuses on what we hadn t optimized the first time around: e2fsck pass 2 times have been cut by a third, and the overall fsck time has been cut by 40%. Not too shabby! Of course, we need to do more testing to make sure we haven t caused other file system benchmarks to degrade, although I m cautiously optimistic that this will end up being a net win. I suspect that some benchmarks will go up by a little, and others will go down a little, depending on how heavily the benchmark exercises directory operations versus sequential I/O patterns. If people want to test this new allocator, it is in the ext4 patch queue. If all goes well, I will hopefully be pushing it to Linus after 2.6.29 is released, at the next merge window.

horizontal separator For comparison s sake, here is a comparison of the fsck time of the same collection of files and directories, comparing ext3 and the original ext4 block and inode allocator. The file system in question is a 32GB install of Ubuntu Jaunty, with a personal home directory, a rather large Maildir directory, some linux kernel trees, and an e2fsprogs tree. It s basically the emergency environment I set up on my Netbook at FOSDEM. In all cases the file systems were freshly copied from the original root directory using the command rsync -axH / /mnt. It s actually a bit surprising to me that ext3 s pass 2 e2fsck times was that much better than e2fsck time under the old ext4 allocator. My previous experience has shown that the two are normally about the same, with a write throughput of around 9-10 MB/s on for e2fsck s pass 2 for both ext3 file systems and ext4 file systems with the original inode/block allocators. Hence, I would have expected ext3 s pass2 time to have been 12-13 seconds, and not 6. I m not sure how that happened, unless it was the luck of draw in terms of how things ended up getting allocated on disk. So I m not too sure what happened there, but overall things look quite good for ext4 and fsck times!
Comparison of e2fsck times on an 32GB partition

Pass ext3 ext4 old allocator

time (s) I/O time (s) I/O

real user system MB read MB/s real user system MB read MB/s

1 108.40 13.74 11.53 583 5.38 6.69 4.06 0.90 82 12.25

2 5.91 1.74 2.56 133 22.51 13.34 2.30 3.78 133 9.97

3 0.03 0.01 0 1 31.21 0.02 0.01 0 1 63.85

4 0.28 0.27 0 0 0 0.28 0.27 0 0 0

5 3.17 0.92 0.13 2 0.63 2.60 2.31 0.03 1 0.38

Total 118.15 16.75 14.25 718 6.08 23.06 9.03 4.74 216 9.37

Vital Statistics of the 32GB partition

312214 inodes used (14.89%)

263 non-contiguous files (0.1%)

198 non-contiguous directories (0.1%)

# of inodes with ind/dind/tind blocks: 0/0/0

Extent depth histogram: 292698/40

4388697 blocks used (52.32%)

0 bad blocks

1 large file

263549 regular files

28022 directories

5 character device files

1 block device file

5 fifos

615 links

20618 symbolic links (19450 fast symbolic links)

5 sockets

312820 files

Related posts (automatically generated):

Comparison of e2fsck times on an 32GB partition
Pass	ext3		ext4 old allocator
time (s)	I/O	time (s)	I/O
real	user	system	MB read	MB/s	real	user	system	MB read	MB/s
1	108.40	13.74	11.53	583	5.38	6.69	4.06	0.90	82	12.25
2	5.91	1.74	2.56	133	22.51	13.34	2.30	3.78	133	9.97
3	0.03	0.01	0	1	31.21	0.02	0.01	0	1	63.85
4	0.28	0.27	0	0	0	0.28	0.27	0	0	0
5	3.17	0.92	0.13	2	0.63	2.60	2.31	0.03	1	0.38
Total	118.15	16.75	14.25	718	6.08	23.06	9.03	4.74	216	9.37

Vital Statistics of the 32GB partition
312214	inodes used (14.89%)
263	non-contiguous files (0.1%)
198	non-contiguous directories (0.1%)
	# of inodes with ind/dind/tind blocks: 0/0/0
	Extent depth histogram: 292698/40
4388697	blocks used (52.32%)
0	bad blocks
1	large file

263549	regular files
28022	directories
5	character device files
1	block device file
5	fifos
615	links
20618	symbolic links (19450 fast symbolic links)
5	sockets
312820	files

Fast ext4 fsck times This wasn t one of the things we were explicitly engineering...
Wanted: Incremental Backup Solutions that Use a Database Dear Lazyweb, I m looking for recommendations for Open Source backup...
Ext4 is now the primary filesystem on my laptop Over the weekend, I converted my laptop to use the...

Theodore Ts'o: Binary-only device drivers for Linux and the supportability matrix of doom

I came across the following from the ext3-users mailing list. The poor user was stuck on a never-updated RHEL 3 production server and running into kernel panic problems. He was advised to try updating to the latest kernel rpm from Red Hat, but he didn t feel he could do that. In his words:

I m caught between a rock and a hard place due to the EMC PowerPath binary only kernel crack. Which makes it painful to both me and my customers to regularly upgrade the kernel. Not to mention the EMC supportability matrix of doom.

That pretty much sums it all up right there. The good news is that I ve been told that dm-multipath is almost at the point where it has enough functionality to replace PowerPath. Of course, that version isn t yet shipping in distributions, and I m sure it needs more testing, but it ll be good when enterprise users who need this functionality can move to a 100% fully open source storage stack. About the only thing left to do is to work in a mention of the Frying Pan of Doom and the recipe for Quick After-Battle Triple Chocolate Cake into the mix.

Related posts (automatically generated):

Tip o the hat, wag o the finger Linux power savings for laptop users It s interesting to see how far, and yet how much...
How active are your local Linux User s Groups? At the Linux Foundation, I recently had been brainstorming with...

Theodore Ts'o: Binary-only device drivers for Linux and the supportability matrix of doom

I came across the following from the ext3-users mailing list. The poor user was stuck on a never-updated RHEL 3 production server and running into kernel panic problems. He was advised to try updating to the latest kernel rpm from Red Hat, but he didn t feel he could do that. In his words:

I m caught between a rock and a hard place due to the EMC PowerPath binary only kernel crack. Which makes it painful to both me and my customers to regularly upgrade the kernel. Not to mention the EMC supportability matrix of doom.

That pretty much sums it all up right there. The good news is that I ve been told that dm-multipath is almost at the point where it has enough functionality to replace PowerPath. Of course, that version isn t yet shipping in distributions, and I m sure it needs more testing, but it ll be good when enterprise users who need this functionality can move to a 100% fully open source storage stack. About the only thing left to do is to work in a mention of the Frying Pan of Doom and the recipe for Quick After-Battle Triple Chocolate Cake into the mix.

Related posts (automatically generated):

Tip o the hat, wag o the finger Linux power savings for laptop users It s interesting to see how far, and yet how much...
How active are your local Linux User s Groups? At the Linux Foundation, I recently had been brainstorming with...

Theodore Ts'o: Reflections on a complaint from a frustrated git user

Last week, Scott James Remnant posted a series of Git Sucks on his blog, starting with this one here, with follow up entries here and here. His problem? To quote Scott, I want to put a branch I have somewhere so somebody else can get it. That s the whole point of distributed revision-control, collaboration. He thought this was a mind-numbingly trivial operation, and was frustrated when it wasn t a one-line command in git. Part of the problem here is that for most git workflows, most people don t actually use git push . That s why it s not covered in the git tutorial (this was a point of frustration for Scott). In fact, in most large projects, the number of people need to use the scm push command is a very small percentage of the developer population, just as very few developers have commit privileges and are allowed to use the svn commit command in a project using Subversion. When you have a centralized repository, only the privileged few will given commit privileges, for obvious security and quality control reasons. Ah, but in a distributed SCM world, things are more democratic anyone can have their own repository, and so everyone can type the commands git commit or bzr commit . While this is true, the number of people who need to be able to publish their own branch is small. After all, the overhead in setting up your own server just so people can pull changes from you is quite large; and if you are just getting started, and only need to submit one or two patches, or even a large series of patches, e-mail is a far more convenient route. This is especially true in the early days of git s development, before web sites such as git.or.cz, github, and gitorious made it much easier for people to publish their own git repository. Even for a large series of changes, tools such as git format-patch and git send-email are very convenient for sending a patch series, and on the receiving side, the maintainer can use git am to apply a patch series sent via e-mail. It turns out that from a maintainer s point of view, reviewing patches via e-mail is often much more convenient. Especially for developers who are just starting out with submitting patches to a project, it s rare that a patch is of sufficiently high quality that it can be applied directly into the repository without needing fixups of one kind or another. The patch might not have the right coding style compared to the surrounding code, or it might be fundamentally buggy because the patch submitter didn t understand the code completely. Indeed, more often than not, when someone submits a patch to me, it is more useful for indicating the location of the bug more than anything else, and I often have to completely rewrite the patch before it enters into the e2fsprogs mainline repository. Given that, publishing a patch that will require modification in a public repository where it is ready to be pulled just doesn t make sense for many entry-level patch submitters. E-mail is in fact less work, and more appropriate for review purposes. It is only when a mid-level to senior developer is trusted to create high quality patches that do not need review that publishing their branch in a pull-ready form really makes sense. And that is fairly rare, and why it is not covered in most entry-level git documentation and tutorials. Unfortunately, many people expect to see the command scm push in a distributed SCM, and since git pull is a commonly used command for beginning git users, they expect that they should use git push as well not realizing that in a distributed SCM, push and pull are not symmetric operations. Therefore, while most git users won t need to use git push , git tutorials and other web pages which are attempting to introduce git to new users probably do need to do a better job explaining why most beginning participants in a project probably don t need their own publically accessible repository that other people can pull from, and which they can push changes for publication. There is one exception to this, of course, and this is a developer who wants to get started using git for a new project which he or she is starting and is the author/maintainer, or someone who is interested in converting their project to git. And this is where bzr has an advantage over git, in that bzr is primarily funded by Canonical, which has a strong interest in pushing an on-line web service, Launchpad. This makes it easier for bzr to have relatively simple recipes for sharing a bzr repository, since the user doesn t need to have access to a server with a public IP address, or need to set up a web or bzr server; they can simply take advantage of Launchpad. Of course, there are web sites which make it easy for people to publish their git repositories; earlier, I had mentioned git.or.cz, github, and gitorious. Currently, the git documentation and tutorials don t mention them since they aren t formally affiliated with the git project (although they are used by many git users and developers and the maintainers of these sites have contributed a large amount of code and documentation to git). This should change, I think. Scott s frustrations which kicked off his git sucks complaints would have been solved if the Git tutorial recommended that the easist ways for someone to publicly publish their repository is via one of these public web sites (although people who want to set up their own server certainly free to do so). Most of these public repositories probably won t have much reason to exist, but they don t do much harm, and who knows? While most of the repositories published at github and gitoriuous will be like the hundreds of thousands of abandoned projects on Sourceforge, one or two of the new projects which someone starts experimenting on at github or gitorious could turn out to be the next Ruby on Rails or Python or Linux. And hopefully, they will allow more developers to be able to experiment with publishing commits on their own repositories, and lessen the frustrations of people like Scott who thought they needed their own repositories; whether or not a public repository is the best way for them to do what they need to do, at least this way they won t get as frustrated about git.

Related posts (automatically generated):

Git and hg John Goerzen recently posted about Git, Mercurial and Bzr that...
Batches of patched batches of patches I found the following from the Risks Digest, authored by...

Theodore Ts'o: Reflections on a complaint from a frustrated git user

Last week, Scott James Remnant posted a series of Git Sucks on his blog, starting with this one here, with follow up entries here and here. His problem? To quote Scott, I want to put a branch I have somewhere so somebody else can get it. That s the whole point of distributed revision-control, collaboration. He thought this was a mind-numbingly trivial operation, and was frustrated when it wasn t a one-line command in git. Part of the problem here is that for most git workflows, most people don t actually use git push . That s why it s not covered in the git tutorial (this was a point of frustration for Scott). In fact, in most large projects, the number of people need to use the scm push command is a very small percentage of the developer population, just as very few developers have commit privileges and are allowed to use the svn commit command in a project using Subversion. When you have a centralized repository, only the privileged few will given commit privileges, for obvious security and quality control reasons. Ah, but in a distributed SCM world, things are more democratic anyone can have their own repository, and so everyone can type the commands git commit or bzr commit . While this is true, the number of people who need to be able to publish their own branch is small. After all, the overhead in setting up your own server just so people can pull changes from you is quite large; and if you are just getting started, and only need to submit one or two patches, or even a large series of patches, e-mail is a far more convenient route. This is especially true in the early days of git s development, before web sites such as git.or.cz, github, and gitorious made it much easier for people to publish their own git repository. Even for a large series of changes, tools such as git format-patch and git send-email are very convenient for sending a patch series, and on the receiving side, the maintainer can use git am to apply a patch series sent via e-mail. It turns out that from a maintainer s point of view, reviewing patches via e-mail is often much more convenient. Especially for developers who are just starting out with submitting patches to a project, it s rare that a patch is of sufficiently high quality that it can be applied directly into the repository without needing fixups of one kind or another. The patch might not have the right coding style compared to the surrounding code, or it might be fundamentally buggy because the patch submitter didn t understand the code completely. Indeed, more often than not, when someone submits a patch to me, it is more useful for indicating the location of the bug more than anything else, and I often have to completely rewrite the patch before it enters into the e2fsprogs mainline repository. Given that, publishing a patch that will require modification in a public repository where it is ready to be pulled just doesn t make sense for many entry-level patch submitters. E-mail is in fact less work, and more appropriate for review purposes. It is only when a mid-level to senior developer is trusted to create high quality patches that do not need review that publishing their branch in a pull-ready form really makes sense. And that is fairly rare, and why it is not covered in most entry-level git documentation and tutorials. Unfortunately, many people expect to see the command scm push in a distributed SCM, and since git pull is a commonly used command for beginning git users, they expect that they should use git push as well not realizing that in a distributed SCM, push and pull are not symmetric operations. Therefore, while most git users won t need to use git push , git tutorials and other web pages which are attempting to introduce git to new users probably do need to do a better job explaining why most beginning participants in a project probably don t need their own publically accessible repository that other people can pull from, and which they can push changes for publication. There is one exception to this, of course, and this is a developer who wants to get started using git for a new project which he or she is starting and is the author/maintainer, or someone who is interested in converting their project to git. And this is where bzr has an advantage over git, in that bzr is primarily funded by Canonical, which has a strong interest in pushing an on-line web service, Launchpad. This makes it easier for bzr to have relatively simple recipes for sharing a bzr repository, since the user doesn t need to have access to a server with a public IP address, or need to set up a web or bzr server; they can simply take advantage of Launchpad. Of course, there are web sites which make it easy for people to publish their git repositories; earlier, I had mentioned git.or.cz, github, and gitorious. Currently, the git documentation and tutorials don t mention them since they aren t formally affiliated with the git project (although they are used by many git users and developers and the maintainers of these sites have contributed a large amount of code and documentation to git). This should change, I think. Scott s frustrations which kicked off his git sucks complaints would have been solved if the Git tutorial recommended that the easist ways for someone to publicly publish their repository is via one of these public web sites (although people who want to set up their own server certainly free to do so). Most of these public repositories probably won t have much reason to exist, but they don t do much harm, and who knows? While most of the repositories published at github and gitoriuous will be like the hundreds of thousands of abandoned projects on Sourceforge, one or two of the new projects which someone starts experimenting on at github or gitorious could turn out to be the next Ruby on Rails or Python or Linux. And hopefully, they will allow more developers to be able to experiment with publishing commits on their own repositories, and lessen the frustrations of people like Scott who thought they needed their own repositories; whether or not a public repository is the best way for them to do what they need to do, at least this way they won t get as frustrated about git.

Related posts (automatically generated):

Git and hg John Goerzen recently posted about Git, Mercurial and Bzr that...
Batches of patched batches of patches I found the following from the Risks Digest, authored by...

Theodore Ts'o: Should Filesystems Be Optimized for SSD s?

In one of the comments to my last blog entry, an anonymous commenter writes:

You seem to be taking a different perspective to linus on the adapting to the the disk technology front (Linus seems to against having to have the OS know about disk boundaries and having to do levelling itself)

That s an interesting question, and I figure it s worth its own top-level entry, as opposed to a reply in the comment stream. One of the interesting design questions in any OS or Computer Architecture is where the abstraction boundaries should be drawn and which side of an abstraction boundary should various operations be pushed. Linus s arguments is that there a flash controller can do a better job of wear leveling, including detecting how worn a particular flash cell might be (for example, perhaps by looking at the charge levels at an analog level and knowing when the last time the cell was programmed), and so it doesn t make sense to try to do wear leveling in a flash file system. Some responsibilities of flash management, such as coalescing newly written blocks into erase blocks to avoid write amplification can be done either on the SSD or in the file system for example, by using a log-structured file system, or some other copy-on-write file system, instead of a rewrite-in-place style file system, you can essentially solve the write amplification problem. In some cases, it s necessary let additional information leak across the abstraction for example, the ATA TRIM command is a way for the file system to let the disk know that certain blocks no longer need to be used. If too much information needs to be pushed across the abstraction, one way or another, then maybe we need to rethink whether the abstraction barrier is in the right place. In addition, if the abstraction has been around for a long time, changing it also has costs, which has to be taken into account. The 512 byte sector LBA abstraction has been around long time, and therefore dislodging it is difficult and costly. For example, the same argument which says that because the underlying hardware details are changing between different generations of SSD is all of these details should be hidden in hardware, was also used to justify something that has been a complete commercial failure for years if not decades: Object Based Disks. One of the arguments of OBD s was that the hard drive has the best knowledge of how and where to store an contiguous stream of bytes, and so perhaps filesystems should not be trying to decide where on disk an inode should be stored, but instead tell the hard drive, I have this object, which is 134 kilobytes long; please store it somewhere on the disk . At least in theory the HDD or SSD could handle all of the details of knowing the best place to store the object on the spinning magnetic media or flash media, taking into account how worn the flash is and automatically move the object around in the case of an SSD, and in the case of the HDD, the drive could know about (real) cylinder and track boundaries, and store the object in the most efficient way possible, since the drive has intimate knowledge about the low-level details of how data is stored on the disk. This theory makes a huge amount of sense; but there s only one problem. Object Based Disks have been proposed in academia and advanced R&D shops of companies like Seagate et.al. have been proposing them for over a decade, with absolutely nothing to show for it. Why? There have been two reasons proposed. One is that OBD vendors were too greedy, and tried to charge too much money for OBD s. Another explanation is that the interface abstraction for OBD s was too different, and so there wasn t enough software or file systems or OS s that could take advantage of OBD s. Both explanations undoubtedly contributed to the commercial failure of OBD s, but the question is which is the bigger reason. And the reason why it is particularly important here is because at least as far as Intel s SSD strategy is concerned, its advantage is that (modulo implementation shortcomings such as the reported internal LBA remapping table fragmentation problem and the lack of ATA TRIM support) filesystems don t need to change (much) in order to take advantage of the Intel SSD and get at least decent performance. However, if the price delta is a stronger reason for its failure, then the X25-M may be in trouble. Currently the 80GB Intel X25-M has a street price of $400, so it costs roughly $5 per gigabyte. Dumb MLC SATA SSD s are available for roughly half the cost/gigabyte (64 GB for $164). So what does the market look like 12-18 months from now? If dumb SSD s are still available at 50% of the cost of smart SSD s, it would probably be worth it to make a copy-on-write style filesystem that attempts to do the wear leveling and write amplification reduction in software. Sure, it s probably more efficient to do it in hardware, but a 2x price differential might cause people will settle for a cheaper solution even if isn t the absolutely best technical choice. On the hand, if prices drop significantly, and/or dumb SSD s completely disappear from the market, then time spent now optimizing for dumb SSD s will be completely wasted. So for Linus to make the proclamation that it s completely stupid to optimize for dumb SSD s seems to be a bit premature. Market externalities for example, does Intel have patents that will prevent competing smart SSD s from entering the market and thus forcing price drops? could radically change the picture. It s not just a pure technological choice, which is what makes projections and prognostications difficult. As another example, I don t know whether or not Intel will issue a firmware update that adds ATA TRIM support to the X25-M, or how long it will take before such SSD s become available. Until ATA TRIM support becomes available, it will be advantageous to add support in ext4 for a block allocator option that aggressively reuses blocks above all else, and avoids using blocks that have never been allocated or used before, even if it causes more in-file system fragmentation and deeper extent allocation trees. The reason for this is at the moment, once a block is used by the file system, at least today, the X25-M has absolutely no idea whether we still care about the contents of that block, or whether the block has since been released when the file was deleted. However, if 20% of the SSD s blocks have never been used, the X25-M can use 20% of the flash for better garbage collection and defragmentation algorithms. And if Intel never releases a firmware update to add ATA TRIM support, then I will be out $400 out of my own pocket for an SSD that lacks this capability, and so adding a block allocator which works around limitations of the X25-M probably makes sense. If it turns out that it takes two years before disks that have ATA TRIM support show up, then it will definitely make sense to add such an optimization. (Hard drive vendors have been historically S-L-O-W to finish standardizing new features and then letting such features enter the market place, so I m not necessarily holding my breath; after all, the Linux block device layer and and file systems have been ready to send ATA TRIM support for about six months; what s taking the ATA committees and SSD vendors so long? <grin> On the other hand, if Intel releases ATA TRIM support next month, then it might not be worth my effort to add such a mount option to ext4. Or maybe Sandisk will make an ATA TRIM capable SSD available soon, and which is otherwise competitive with Intel, and I get a free sample, but it turns out another optimization on Sandisk SSD s will give me an extra 10% performance gain under some workloads. Is it worth it in that case? Hard to tell, unless I know whether such a tweak addresses an optimization problem which is fundamental, and whether or not such a tweak will either be unnecessary, or perhaps actively unhelpful in the next generation. As long as SSD manufacturers force us treat these devices as black boxes, there may be a certain amount of cargo cult science which may be forced upon us file system designers or I guess I should say, in order to be more academically respectable, we will be forced to rely more on empirical measurements leading to educated engineering estimations about what the SSD is doing inside the black box . Heh. Related posts (automatically generated):

Aligning filesystems to an SSD s erase block size I recently purchased a new toy, an Intel X25-M SSD,...

Theodore Ts'o: Should Filesystems Be Optimized for SSD s?

In one of the comments to my last blog entry, an anonymous commenter writes:

You seem to be taking a different perspective to linus on the adapting to the the disk technology front (Linus seems to against having to have the OS know about disk boundaries and having to do levelling itself)

That s an interesting question, and I figure it s worth its own top-level entry, as opposed to a reply in the comment stream. One of the interesting design questions in any OS or Computer Architecture is where the abstraction boundaries should be drawn and which side of an abstraction boundary should various operations be pushed. Linus s arguments is that there a flash controller can do a better job of wear leveling, including detecting how worn a particular flash cell might be (for example, perhaps by looking at the charge levels at an analog level and knowing when the last time the cell was programmed), and so it doesn t make sense to try to do wear leveling in a flash file system. Some responsibilities of flash management, such as coalescing newly written blocks into erase blocks to avoid write amplification can be done either on the SSD or in the file system for example, by using a log-structured file system, or some other copy-on-write file system, instead of a rewrite-in-place style file system, you can essentially solve the write amplification problem. In some cases, it s necessary let additional information leak across the abstraction for example, the ATA TRIM command is a way for the file system to let the disk know that certain blocks no longer need to be used. If too much information needs to be pushed across the abstraction, one way or another, then maybe we need to rethink whether the abstraction barrier is in the right place. In addition, if the abstraction has been around for a long time, changing it also has costs, which has to be taken into account. The 512 byte sector LBA abstraction has been around long time, and therefore dislodging it is difficult and costly. For example, the same argument which says that because the underlying hardware details are changing between different generations of SSD is all of these details should be hidden in hardware, was also used to justify something that has been a complete commercial failure for years if not decades: Object Based Disks. One of the arguments of OBD s was that the hard drive has the best knowledge of how and where to store an contiguous stream of bytes, and so perhaps filesystems should not be trying to decide where on disk an inode should be stored, but instead tell the hard drive, I have this object, which is 134 kilobytes long; please store it somewhere on the disk . At least in theory the HDD or SSD could handle all of the details of knowing the best place to store the object on the spinning magnetic media or flash media, taking into account how worn the flash is and automatically move the object around in the case of an SSD, and in the case of the HDD, the drive could know about (real) cylinder and track boundaries, and store the object in the most efficient way possible, since the drive has intimate knowledge about the low-level details of how data is stored on the disk. This theory makes a huge amount of sense; but there s only one problem. Object Based Disks have been proposed in academia and advanced R&D shops of companies like Seagate et.al. have been proposing them for over a decade, with absolutely nothing to show for it. Why? There have been two reasons proposed. One is that OBD vendors were too greedy, and tried to charge too much money for OBD s. Another explanation is that the interface abstraction for OBD s was too different, and so there wasn t enough software or file systems or OS s that could take advantage of OBD s. Both explanations undoubtedly contributed to the commercial failure of OBD s, but the question is which is the bigger reason. And the reason why it is particularly important here is because at least as far as Intel s SSD strategy is concerned, its advantage is that (modulo implementation shortcomings such as the reported internal LBA remapping table fragmentation problem and the lack of ATA TRIM support) filesystems don t need to change (much) in order to take advantage of the Intel SSD and get at least decent performance. However, if the price delta is a stronger reason for its failure, then the X25-M may be in trouble. Currently the 80GB Intel X25-M has a street price of $400, so it costs roughly $5 per gigabyte. Dumb MLC SATA SSD s are available for roughly half the cost/gigabyte (64 GB for $164). So what does the market look like 12-18 months from now? If dumb SSD s are still available at 50% of the cost of smart SSD s, it would probably be worth it to make a copy-on-write style filesystem that attempts to do the wear leveling and write amplification reduction in software. Sure, it s probably more efficient to do it in hardware, but a 2x price differential might cause people will settle for a cheaper solution even if isn t the absolutely best technical choice. On the hand, if prices drop significantly, and/or dumb SSD s completely disappear from the market, then time spent now optimizing for dumb SSD s will be completely wasted. So for Linus to make the proclamation that it s completely stupid to optimize for dumb SSD s seems to be a bit premature. Market externalities for example, does Intel have patents that will prevent competing smart SSD s from entering the market and thus forcing price drops? could radically change the picture. It s not just a pure technological choice, which is what makes projections and prognostications difficult. As another example, I don t know whether or not Intel will issue a firmware update that adds ATA TRIM support to the X25-M, or how long it will take before such SSD s become available. Until ATA TRIM support becomes available, it will be advantageous to add support in ext4 for a block allocator option that aggressively reuses blocks above all else, and avoids using blocks that have never been allocated or used before, even if it causes more in-file system fragmentation and deeper extent allocation trees. The reason for this is at the moment, once a block is used by the file system, at least today, the X25-M has absolutely no idea whether we still care about the contents of that block, or whether the block has since been released when the file was deleted. However, if 20% of the SSD s blocks have never been used, the X25-M can use 20% of the flash for better garbage collection and defragmentation algorithms. And if Intel never releases a firmware update to add ATA TRIM support, then I will be out $400 out of my own pocket for an SSD that lacks this capability, and so adding a block allocator which works around limitations of the X25-M probably makes sense. If it turns out that it takes two years before disks that have ATA TRIM support show up, then it will definitely make sense to add such an optimization. (Hard drive vendors have been historically S-L-O-W to finish standardizing new features and then letting such features enter the market place, so I m not necessarily holding my breath; after all, the Linux block device layer and and file systems have been ready to send ATA TRIM support for about six months; what s taking the ATA committees and SSD vendors so long? <grin> On the other hand, if Intel releases ATA TRIM support next month, then it might not be worth my effort to add such a mount option to ext4. Or maybe Sandisk will make an ATA TRIM capable SSD available soon, and which is otherwise competitive with Intel, and I get a free sample, but it turns out another optimization on Sandisk SSD s will give me an extra 10% performance gain under some workloads. Is it worth it in that case? Hard to tell, unless I know whether such a tweak addresses an optimization problem which is fundamental, and whether or not such a tweak will either be unnecessary, or perhaps actively unhelpful in the next generation. As long as SSD manufacturers force us treat these devices as black boxes, there may be a certain amount of cargo cult science which may be forced upon us file system designers or I guess I should say, in order to be more academically respectable, we will be forced to rely more on empirical measurements leading to educated engineering estimations about what the SSD is doing inside the black box . Heh. Related posts (automatically generated):

Aligning filesystems to an SSD s erase block size I recently purchased a new toy, an Intel X25-M SSD,...

Theodore Ts'o: Aligning filesystems to an SSD s erase block size

I recently purchased a new toy, an Intel X25-M SSD, and when I was setting it up initially, I decided I wanted to make sure the file system was aligned on an erase block boundary. This is a generally considered to be a Very Good Thing to do for most SSD s available today, although there s some question about how important this really is for Intel SSD s more on that in a moment. It turns out this is much more difficult than you might first think most of Linux s storage stack is not set up well to worry about alignment of partitions and logical volumes. This is surprising, because it s useful for many things other than just SSD s. This kind of alignment is important if you are using any kind of hardware or software RAID, for example, especially RAID 5, because if writes are done on stripe boundaries, it can avoid a read-modify-write overhead. In addition, the hard drive industry is planning on moving to 4096 byte sectors instead of the way-too-small 512 byte sectors at some point in the future. Linux s default partition geometry of 255 heads and 63 sectors/track means that there are 16065 (512 byte) sectors per cylinder. The initial round of 4k sector disks will emulate 512 byte disks, but if the partitions are not 4k aligned, then the disk will end up doing a read/modify/write on two internal 4k sectors for each singleton 4k file system write, and that would be unfortunate. Vista has already started working around this problem, since it uses a default partitioning geometry of 240 heads and 63 sectors/track. This results in a cylinder boundary which is divisible by 8, and so the partitions (with the exception of the first, which is still misaligned unless you play some additional tricks) are 4k aligned. So this is one place where Vista is ahead of Linux . unfortunately the default 255 heads and 63 sectors is hard coded in many places in the kernel, in the SCSI stack, and in various partitioning programs; so fixing this will require changes in many places. However, with SSD s (remember SSD s? This is a blog post about SSD s ) you need to align partitions on at least 128k boundaries for maximum efficiency. The best way to do this that I ve found is to use 224 (32*7) heads and 56 (8*7) sectors/track. This results in 12544 (or 256*49) sectors/cylinder, so that each cylinder is 49*128k. You can do this by doing starting fdisk with the following options when first partitioning the SSD: # fdisk -H 224 -S 56 /dev/sdb The first partition will only be aligned on a 4k boundary, since in order to be compatible with MS-DOS, the first partition starts on track 1 instead of track 0, but I didn t worry too much about that since I tend to use the first partition for /boot, which tends not to get modified much. You can go into expert mode with fdisk and force the partition to begin on an 128k alignment, but many Linux partition tools will complain about potential compatibility problems (which are obsolete warnings, since the systems that would have booting systems with these issues haven t been made in about ten years), but I didn t needed to do that, so I didn t worry about it. So I created a 1 gigabyte /boot partition as /dev/sdb1, and allocated the rest of the SSD for use by LVM as /dev/sdb2. And that s where I ran into my next problem. LVM likes to allocate 192k for its header information, and 192k is not a multiple of 128k. So if you are creating file systems as logical volumes, and you want those volume to be properly aligned you have to tell LVM that it should reserve slightly more space for its meta-data, so that the physical extents that it allocates for its logical volumes are properly aligned. Unfortunately, the way this is done is slightly baroque: # pvcreate metadatasize 250k /dev/sdb2
Physical volume /dev/sdb2 successfully created Why 250k and not 256k? I can t tell you sometimes the LVM tools aren t terribly intuitive. However, you can test to make sure that physical extents start at the proper offset by using: # pvs /dev/sdb2 -o+pe_start
PV VG Fmt Attr PSize PFree 1st PE
/dev/sdb2 lvm2 73.52G 73.52G 256.00K If you use a metadata size of 256k, the first PE will be at 320k instead of 256k. There really ought to be an pe-align option to pvcreate, which would be far more user-friendly, but, we have to work with the tools that we have. Maybe in the next version of the LVM support tools . Once you do this, we re almost done. The last thing to do is to create the file system. As it turns out, if you are using ext4, there is a way to tell the file system that it should try to align files so they match up with the RAID stripe width. (These techniques can be used for RAID disks as well). If your SSD has an 128k erase block size, and you are creating the file system with the default 4k block size, you just have to specify a strip width when you create the file system, like so: # mke2fs -t ext4 -E stripe-width=32,resize=500G /dev/ssd/root (The resize=500G limits the number of blocks reserved for resizing this file system so that the guaranteed number size that the file system can be grown via online resize is 500G. The default is 1000 times the initial file system size, which is often far too big to be reasonable. Realistically, the file system I am creating is going to be used for a desktop device, and I don t foresee needing to resize it beyond 500G, so this saves about a 50 megabytes or so. Not a huge deal, but waste not, want not , as the saying goes.) With e2fsprogs 1.41.4, the journal will be 128k aligned, as will the start of the file system, and with the stripe-width specified, the ext4 allocator will try to align block writes to the stripe width where that makes sense. So this is as good as it gets without kernel changes to make the block and inode allocators more SSD aware, something which I hope to have a chance to look at.

horizontal separator All of this being said, it s time to revisit this question is all of this needed for a smart , better by design next-generation SSD such as Intel s? Aligning your file system on an erase block boundary is critical on first generation SSD s, but the Intel X25-M is supposed to have smarter algorithms that allow it to reduce the effect of write-amplification. The details are a little bit vague, but presumably there is a mapping table which maps sectors (at some internal sector size we don t know for sure whether it s 512 bytes or some larger size) to individual erase blocks. If the file system sends a series of 4k writes for file system blocks 10, 12, 13, 32, 33, 34, 35, 64, 65, 66, 67, 96, 97, 98, 99, followed by a barrier operation, a traditional SSD might do read/modify/write on four 128k erase blocks one covering the blocks 0-31, another for the blocks 32-63, and so on. However, the Intel SSD will simply write a single 128k block that indicates where the latest versions of blocks 10, 12, 13, 32, 33, 34, 35, 64, 65, 66, 67, 96, 97, 98, and 99 can be found. This technique tends to work very well. However, over time, the table will get terribly fragmented, and depending on whether the internal block sector size is 512 or 4k (or something in between), there could be a situation where all but one or two of the internal sectors on the disks have been mapped away to other erase blocks, leading to fragmentation of the erase blocks. This is not just a theoretical problem; there are reports from the field that this happens relatively easy. For example, see Allyn Malventano s Long-term performance analysis of Intel Mainstream SSDs and Marc Prieur s report from BeHardware.com which includes an official response from Intel regarding this phenomenon. Laurent Gilson posted on the Linux-Thinkpad mailing list that when he tried using the X25-M to store commit journals for a database, that after writing 170% of the capacity of an Intel SSD, the small writes caused the write performance to go through the floor. More troubling, Allyn Malventano indicated that if the drive is abused for too long with a mixture of small and large writes, it can get into a state where the performance degredation is permanent, and even a series of large writes apparently does not restore the drive s function only an ATA SECURITY ERASE command to completely reset the mapping table seems to help. So, what can be done to prevent this? Allyn s review speculates that aligning writes to erase write boundaries can help I m not 100% sure this is true, but without detailed knowledge of what is going on under the covers in Intel s SSD, we won t know for sure. It certainly can t hurt, though, and there is a distinct possibility that the internal sector size is larger than 512 bytes, which means the default partitioning scheme of 255 heads/63 sectors is probably not a good idea. (Even Vista has moved to a 240/63 scheme, which gives you 8k alignment of partitions; I prefer 224/56 partitioning, since the days when BIOS s used C/H/S I/O are long gone.) The Ext3 and Ext4 file system tend to defer meta-data writes by pinning them until a transaction commit; this definitely helps, and ext4 allows you to configure an erase block boundary, which should also be helpful. Enabling laptop mode will discourage writing to the disk except in large blocks, which probably helps significantly as well. And avoiding fsync() in applications will also be helpful, since a cache flush operation will force the SSD to write to an erase block even if it isn t completely filled. Beyond that, clearly some experimentation will be needed. My current thinking is to use a standard file system aging workload, and then performing an I/O benchmark to see if there has been any performance degradation. I can then vary various file system tuning parameters and algorithms, confirm whether or not a heavy fsync workload makes the performance worse. In the long term, hopefully Intel will release a firmware update which adds support for ATA TRIM/DISCARD commands, which will allow the file system to inform the SSD that various blocks have been deleted and no longer need to be preserved by the SSD. I suspect this will be a big help, if the SSD knows that certain sectors no longer need to be preserved, it can avoid copying them when trying to defragment the SSD. Given how expensive the X25-M SSD s are, I hope that there will be a firmware update to support this, and that Intel won t leave its early adopters high and dry by only offering this functionality in newer models of the SSD. If they were to do that, it would leave many of these early adopters, especially your humble writer (who paid for his SSD out of his own pocket), to be quite grumpy indeed. Hopefully, though, it won t come to that. Update: I ve since penned a follow-up post Should Filesystems Be Optimized for SSD s? Related posts (automatically generated):

Should Filesystems Be Optimized for SSD s? In one of the comments to my last blog entry,...
Fast ext4 fsck times This wasn t one of the things we were explicitly engineering...

Theodore Ts'o: Aligning filesystems to an SSD s erase block size

I recently purchased a new toy, an Intel X25-M SSD, and when I was setting it up initially, I decided I wanted to make sure the file system was aligned on an erase block boundary. This is a generally considered to be a Very Good Thing to do for most SSD s available today, although there s some question about how important this really is for Intel SSD s more on that in a moment. It turns out this is much more difficult than you might first think most of Linux s storage stack is not set up well to worry about alignment of partitions and logical volumes. This is surprising, because it s useful for many things other than just SSD s. This kind of alignment is important if you are using any kind of hardware or software RAID, for example, especially RAID 5, because if writes are done on stripe boundaries, it can avoid a read-modify-write overhead. In addition, the hard drive industry is planning on moving to 4096 byte sectors instead of the way-too-small 512 byte sectors at some point in the future. Linux s default partition geometry of 255 heads and 63 sectors/track means that there are 16065 (512 byte) sectors per cylinder. The initial round of 4k sector disks will emulate 512 byte disks, but if the partitions are not 4k aligned, then the disk will end up doing a read/modify/write on two internal 4k sectors for each singleton 4k file system write, and that would be unfortunate. Vista has already started working around this problem, since it uses a default partitioning geometry of 240 heads and 63 sectors/track. This results in a cylinder boundary which is divisible by 8, and so the partitions (with the exception of the first, which is still misaligned unless you play some additional tricks) are 4k aligned. So this is one place where Vista is ahead of Linux . unfortunately the default 255 heads and 63 sectors is hard coded in many places in the kernel, in the SCSI stack, and in various partitioning programs; so fixing this will require changes in many places. However, with SSD s (remember SSD s? This is a blog post about SSD s ) you need to align partitions on at least 128k boundaries for maximum efficiency. The best way to do this that I ve found is to use 224 (32*7) heads and 56 (8*7) sectors/track. This results in 12544 (or 256*49) sectors/cylinder, so that each cylinder is 49*128k. You can do this by doing starting fdisk with the following options when first partitioning the SSD: # fdisk -H 224 -S 56 /dev/sdb The first partition will only be aligned on a 4k boundary, since in order to be compatible with MS-DOS, the first partition starts on track 1 instead of track 0, but I didn t worry too much about that since I tend to use the first partition for /boot, which tends not to get modified much. You can go into expert mode with fdisk and force the partition to begin on an 128k alignment, but many Linux partition tools will complain about potential compatibility problems (which are obsolete warnings, since the systems that would have booting systems with these issues haven t been made in about ten years), but I didn t needed to do that, so I didn t worry about it. So I created a 1 gigabyte /boot partition as /dev/sdb1, and allocated the rest of the SSD for use by LVM as /dev/sdb2. And that s where I ran into my next problem. LVM likes to allocate 192k for its header information, and 192k is not a multiple of 128k. So if you are creating file systems as logical volumes, and you want those volume to be properly aligned you have to tell LVM that it should reserve slightly more space for its meta-data, so that the physical extents that it allocates for its logical volumes are properly aligned. Unfortunately, the way this is done is slightly baroque: # pvcreate metadatasize 250k /dev/sdb2
Physical volume /dev/sdb2 successfully created Why 250k and not 256k? I can t tell you sometimes the LVM tools aren t terribly intuitive. However, you can test to make sure that physical extents start at the proper offset by using: # pvs /dev/sdb2 -o+pe_start
PV VG Fmt Attr PSize PFree 1st PE
/dev/sdb2 lvm2 73.52G 73.52G 256.00K If you use a metadata size of 256k, the first PE will be at 320k instead of 256k. There really ought to be an pe-align option to pvcreate, which would be far more user-friendly, but, we have to work with the tools that we have. Maybe in the next version of the LVM support tools . Once you do this, we re almost done. The last thing to do is to create the file system. As it turns out, if you are using ext4, there is a way to tell the file system that it should try to align files so they match up with the RAID stripe width. (These techniques can be used for RAID disks as well). If your SSD has an 128k erase block size, and you are creating the file system with the default 4k block size, you just have to specify a strip width when you create the file system, like so: # mke2fs -t ext4 -E stripe-width=32,resize=500G /dev/ssd/root (The resize=500G limits the number of blocks reserved for resizing this file system so that the guaranteed number size that the file system can be grown via online resize is 500G. The default is 1000 times the initial file system size, which is often far too big to be reasonable. Realistically, the file system I am creating is going to be used for a desktop device, and I don t foresee needing to resize it beyond 500G, so this saves about a 50 megabytes or so. Not a huge deal, but waste not, want not , as the saying goes.) With e2fsprogs 1.41.4, the journal will be 128k aligned, as will the start of the file system, and with the stripe-width specified, the ext4 allocator will try to align block writes to the stripe width where that makes sense. So this is as good as it gets without kernel changes to make the block and inode allocators more SSD aware, something which I hope to have a chance to look at.

horizontal separator All of this being said, it s time to revisit this question is all of this needed for a smart , better by design next-generation SSD such as Intel s? Aligning your file system on an erase block boundary is critical on first generation SSD s, but the Intel X25-M is supposed to have smarter algorithms that allow it to reduce the effect of write-amplification. The details are a little bit vague, but presumably there is a mapping table which maps sectors (at some internal sector size we don t know for sure whether it s 512 bytes or some larger size) to individual erase blocks. If the file system sends a series of 4k writes for file system blocks 10, 12, 13, 32, 33, 34, 35, 64, 65, 66, 67, 96, 97, 98, 99, followed by a barrier operation, a traditional SSD might do read/modify/write on four 128k erase blocks one covering the blocks 0-31, another for the blocks 32-63, and so on. However, the Intel SSD will simply write a single 128k block that indicates where the latest versions of blocks 10, 12, 13, 32, 33, 34, 35, 64, 65, 66, 67, 96, 97, 98, and 99 can be found. This technique tends to work very well. However, over time, the table will get terribly fragmented, and depending on whether the internal block sector size is 512 or 4k (or something in between), there could be a situation where all but one or two of the internal sectors on the disks have been mapped away to other erase blocks, leading to fragmentation of the erase blocks. This is not just a theoretical problem; there are reports from the field that this happens relatively easy. For example, see Allyn Malventano s Long-term performance analysis of Intel Mainstream SSDs and Marc Prieur s report from BeHardware.com which includes an official response from Intel regarding this phenomenon. Laurent Gilson posted on the Linux-Thinkpad mailing list that when he tried using the X25-M to store commit journals for a database, that after writing 170% of the capacity of an Intel SSD, the small writes caused the write performance to go through the floor. More troubling, Allyn Malventano indicated that if the drive is abused for too long with a mixture of small and large writes, it can get into a state where the performance degredation is permanent, and even a series of large writes apparently does not restore the drive s function only an ATA SECURITY ERASE command to completely reset the mapping table seems to help. So, what can be done to prevent this? Allyn s review speculates that aligning writes to erase write boundaries can help I m not 100% sure this is true, but without detailed knowledge of what is going on under the covers in Intel s SSD, we won t know for sure. It certainly can t hurt, though, and there is a distinct possibility that the internal sector size is larger than 512 bytes, which means the default partitioning scheme of 255 heads/63 sectors is probably not a good idea. (Even Vista has moved to a 240/63 scheme, which gives you 8k alignment of partitions; I prefer 224/56 partitioning, since the days when BIOS s used C/H/S I/O are long gone.) The Ext3 and Ext4 file system tend to defer meta-data writes by pinning them until a transaction commit; this definitely helps, and ext4 allows you to configure an erase block boundary, which should also be helpful. Enabling laptop mode will discourage writing to the disk except in large blocks, which probably helps significantly as well. And avoiding fsync() in applications will also be helpful, since a cache flush operation will force the SSD to write to an erase block even if it isn t completely filled. Beyond that, clearly some experimentation will be needed. My current thinking is to use a standard file system aging workload, and then performing an I/O benchmark to see if there has been any performance degradation. I can then vary various file system tuning parameters and algorithms, confirm whether or not a heavy fsync workload makes the performance worse. In the long term, hopefully Intel will release a firmware update which adds support for ATA TRIM/DISCARD commands, which will allow the file system to inform the SSD that various blocks have been deleted and no longer need to be preserved by the SSD. I suspect this will be a big help, if the SSD knows that certain sectors no longer need to be preserved, it can avoid copying them when trying to defragment the SSD. Given how expensive the X25-M SSD s are, I hope that there will be a firmware update to support this, and that Intel won t leave its early adopters high and dry by only offering this functionality in newer models of the SSD. If they were to do that, it would leave many of these early adopters, especially your humble writer (who paid for his SSD out of his own pocket), to be quite grumpy indeed. Hopefully, though, it won t come to that. Update: I ve since penned a follow-up post Should Filesystems Be Optimized for SSD s? Related posts (automatically generated):

Should Filesystems Be Optimized for SSD s? In one of the comments to my last blog entry,...
Fast ext4 fsck times This wasn t one of the things we were explicitly engineering...

Gunnar Wolf: How active are your local Linux,Free Software User Groups?

Ted T'so wonders about the LUGs over the world, seeking to answer a conversation he recently had at the Linux Foundation. He quotes a blog posting in Lenovo, Local User Groups - gone the way of the dinosaur? . I think this is an interesting point to gather input from others. In Mexico City, we did have a strong LUG several years ago, holding not-very-regular-but-good-quality-wise meetings, roughly monthly, at Instituto de Ciencias Nucleares. I was active there ~1996-2001.
By 2001, however, the group stopped acting as one - Maybe one of the main factors is that we had a very strong, unquestionable group leader and cohesion factor (Miguel), who worked at Nucleares and regularly got said auditorium. Once Miguel left to form Ximian, the group slowly disgregated. In one of the last LUG meetings, we started working towards the National Free Software Conference (CONSOL)... Nowadays, in Mexico (as a country) we have several conferences around the year, although I'd be hard-pressed to say whether any of them really fills the needs of a LUG (and my answer would probably be negative). Now, there are several smaller groups that have popped up in the void left by the Mexico City LUG - Mainly LUGs local to universities or faculties... And yes, a 25-million-people city is too large to have a single, functional LUG - Just the geographical size of the thing is too daunting. Besides, we are too many people, even though few of us are contributing any real work. But I also recognize that a local *users* group should care about making the users better, before focusing on making the world a better place ;-) Anyway... My intention with this post, besides writing what I see, is to ask to other people that read me (I know this blog is syndicated at Planeta Linux Mexico, maybe even people reading in other Latin American countries through Planeta EDUSOL) to write what they see at their local communities. To make this a bit more useful, please leave a comment (in English, if possible) at this blog, so this can be used as a summary for Ted's request as well.

Junichi Uekawa: Theo's post on hardlink backup tools made me think.

Theo's post on hardlink backup tools made me think. In his post, Wanted: Incremental Backup Solutions that Use a Database , he proposed making a incremental backup solution on Database (RDBMS). He's right that hardlinking files so much is putting load on filesystem, especially when fscking. One thing is that although conceptually RDBMS is handling relations, physically (and practically) they are doing similar stuff with hardlinked filesystems. They are just optimized differently, tuned for their 'fsck' behavior with many hardlinks. So, there could be a filesystem which can be tuned for hardlinking. Of course, just because I thought about it it doesn't mean I have come up with a good solution to the problem.

Theodore Ts'o: Wanted: Incremental Backup Solutions that Use a Database

Dear Lazyweb, I m looking for recommendations for Open Source backup solutions which track incremental backups using a database, and which do not use hard link directories. Someone gave me a suggested OSS backup program at UDS, but it s slipped my memory; so I m fairly sure that at least one or more such OSS backup solutions exist, but don t know their names. Can some folks give me some suggestions? Thanks!

horizontal separator There are a number of very popular Open Source backup solutions that use a very clever hack of using hard link trees to maintain incremental backups. The advantage of such schemes is that they are very easy to code up, and it allows you to easily browse incremental backups by simply using cd and ls . The disadvantage of such schemes is that it creates very large number of directories blocks which must be validated by an fsck operation. As I ve discussed previously, this causes e2fsck to consume a vast amount of memory; sometimes more than can be supported by 32-bit systems. Another problem which has recently been brought home to me, is how much time it can take to fsck such file systems. This shouldn t have come as a surprise. Replicating the directory hierarchy for each incremental backup is perhaps the most inefficient way you could come up with for storing information for an incremental backup. The filenames are replicated in each directory hierarchy, and even if an entire subtree hasn t changed, the directories associated with that subtree must be replicated for each snapshot. As a result, each incremental snapshot results in a large number of additional directory blocks which must be tracked by the filesystem and checked by fsck. For example, in one very extreme case, a user reported to me that their backup filesystem contained 88 million inodes, of which 77 million of them were directories. Even if we assume that every directory was only a single block long, that still means that during e2fsck s pass 2 processing, 77 million times 4k, or 308 gigabytes, worth of directory blocks must be read into memory by e2fsck s pass 2. Worse yet, these 308 GB of directory blocks are scattered all across the filesystem, which means the time to simply read all of the directory blocks so they can be validated will take a very, very, long time indeed. The real right way to do implement tracking incremental backups is to use a database, since it can much more efficiently store and organize the information of what file is located where, for each incremental snapshot. If the user wants to browse an incremental snapshot via cd and ls , this could be done via a synthetic FUSE filesysem. There s a reason why all of the industrial-strength, enterprise-class backup systems use real databases; it s the same reason why enterprise class databases use their own data files, and not try to store relational tables in a filesystem, even if the filesystem supports b-tree directories and efficient storage of small files. Purpose written-and-opimzied solutions can be far more efficient than general purpose tools. So, can anyone recommend OSS backup solutions which track incremental backups using some kind of database? It doesn t really matter whether it s MySQL, Postgresql, SQLite, so long as it s not using/abusing a file systems hard links to create directory trees for each snapshot. For bonus points, there would be a way to browse a particular incremental snapshot via a FUSE interface. What suggestions can you give me, so I can pass it on to ext3/4 users who are running into problems with backup solutions such as Amanda, Backup-PC, dirvish, etc.? Update: A feature which a number of these hard-link tools do right is that they do de-duplication; that is, they will create hard links between multiple files that have the same contents, even if they are located in different directories with different names (or they have been renamed or their directory structure reorganized since they were originally backed up) and even if the two files are from different clients. This basically means the database needs to include a checksum of the file, and a way to look up to see if the contents of that file have already been backed up, perhaps with a different name. Unfortunately, it looks like many of the tape-based tools, such as Bacula, assume that tape is cheap, so they don t have these sorts of de-duplication features. Does anyone know of non-hard-link backup tools that do de-dup and which are also Open Source?

Theodore Ts'o: Followups to the ebooks ethical question

When I have a moment, I’ll try to tally up the responses that I got to “An ethical question involving ebooks”and see if there are any interesting patterns based on self-identified generational markers. Obviously, this is not a properly controlled survey, so the results aren’t going to mean much, but it is interesting that some fairly passionately written comments came from folks who self-identified as coming from generations that broke with the common stereotypes of their respective demographic groups. If I were going to commission a study, one thing that I would almost certainly do is to ask pose a similar question about music and mp3’s, and do have the surveys asking the question about ebooks first, and half the surveys asking the questions about music first. It would be interesting to see if (a) there is a difference in attitudes between music and books, and (b) whether the order of the questions might influence the answers or not. A number of poeple have asked me about the author’s name and the title of the books/series involved. I deliberately didn’t include that information, for a number of reasons. First of all, I don’t believe idenifying the author/books/character involved is relevant to the question at hand, and in fact, might be distracting. Secondly, given the many comments, some of them quite passionate, I don’t think it would be fair to drag her name into the discussion without her permission first. I will say that the author does have a fairly extensive internet presence, and has apparently gotten a lot of questions about said character, and in fact whether those books would be made into ebooks. It’s been made quite clear that while those books were successful, they weren’t that successful, and so from an economic point of view, she chooses to write books that she (and her publishers) feel will be more economically viable. Because there will likely be no further books published containing this character, it is very unlikely that the publisher will reprint the original series of books — and when asked about whether they would be made available in ebook form, her response was effectively “it’s up to the publisher”, Apparently she has worked with a number of publishers, and while publisher X hasn’t been willing to publish her books in ebook form, publisher Y has. Furthermore, it seems that her contracts apparently delegate all decisions about how her books will be published, and whether a large Major Big City Law Firm with Fangs (aka MBCLFF) will go after copyright infringers to her publishers and her agent (who is a lawyer at said MBCLFF, and who could presumably inflict major Hurt on copyright infringers that curry the lawyer’s disfavor). I don’t know if this is true, or just her way of managing her relationship with her fans by disclaiming all responsibility about publication forms and enforcement decisions to others — but some authors do make such choices, if they are much more interested in the writing and storytelling end of things than the business side of things. Which brings up an interesting question with respect to copyright enforcement. It’s pretty obvious that many people will give different answers to the question relating to how much deference should be given to copyrights depending on whether they are owned by The Struggling Author versus whether they are owned by The Big Media Corporate Monolith, with many more allowances given if the question is framed as being primarily about the former rather than the latter. Another way in which how you frame the question radically changes the outcome depends on whether the focus is on making sure the author (and/or his surviving widow/widower/children) get paid or whether the focus is on control of one’s works. If you believe the primary justification is an economic one, then that leads to a series of ethical conclusions — the most obvious of which is that if it doesn’t result in a direct (or perhaps indirect) monetary loss to the author, there should not be a moral or ethical problem. There might be some question as to whether devaluing the secondary market might discourage the sale of new books, and hence indirectly harm the author sufficiently that this should be a concern, but those issues can be worked out. If however, you believe the primary issue at hand is one of control, a very different set of issues have to get factored into the conversation. For example, what if the author was ashamed of a book or series, and wants it to go quietly out of print, and hopefully disappear. How should that be weighed against fans who disagree with the author and who love the series? What is the right balance? For those who argue that the author’s wishes should be sacrosanct — should we move things more in that direction? What if all texts lived in DRM’ed, encrypted containers, and electronic readers had to ask permission of a central authorization server for the text could be displayed. This would allow the author to, after the fact, disable anyone from reading his or her works, if for some reason the author so desired it. Would that be a good thing? If not — and I hope most authors would agree this would be horrific power to give copyright holders — then it’s clear that author’s moral rights as creators should not be entirely sacrosanct, and that the society also has some claims on preserving its culture, and that once a book has been published and becomes part of the culture, society should have some claim on that book as part of culture. Whether that means that copyright terms should be 14 years or 20 years as opposed to whenever the Disney corporation feels like paying off more legislators to extend copyright terms is one way that question could be asked. Another is whether society should have the right to say that if after some number of years where a work has been abandoned for commercial exploitation, whether it should automatically enter the public domain. There are no obvious answers here. The final point that I want to make, which may be fairly controversial amongst the Open Source programmers in the room, is that if you believe that copyright should be fundamentally be about economic arguments of “no harm, no foul”, that this is in direct contradiction with the belief that lawsuits should be used in order to enforce the GPL. After all, the conditions imposed by the GPL are fundamentally about control, not about economic issues. Consider — if someone uses the Busybox project in an embedded device — especially if no changes has been made to the code — who has been harmed, economically? No harm, no foul, right? Or if someone uses GPLv3 code in a firmware which is protected by a digital signature — sure, it means that end users who want to modify the firmware and then use it to enhace/extend the device won’t be able to do so. But how does that economically harm the author of the GPLv3 code? Fundamentally, Copyleft schemes are all about extending control over how the code can be used. Hence, if you are an Free Software programmer who cheers on the activities of the SFLC, or who firmly believes that no one should be allowed to mix firmware which is not shipped with source with your GPL’ed software, it is completely and profoundly hypocritical to say, “F*ck the author’s wishes; if it’s not available in the from I want, I should be able to make a derived work to transfer the work into a form that I want.” What if the author is a luddite who hates eBooks and firmly believes and wants to enforce that their works should never be made available in eBook form. How is that fundamentally different from a Free Software Acolyte saying that because they abhor non-free firmware, and don’t want allow their code to be shipped alongside binary-only firmware?

Theodore Ts'o: An ethical question involving ebooks

I recently purchased a short story from Fictionwise, which was not DRM’ed, so I could easily get it into a form where I could read it on my Sony eReader. Thanks to that short story, I was introduced to an author, and a character, which I found very engaging. When I decided to find out more about the character, I found that the author had written two additional short stories, and three additional novels many years ago, but has since stopped writing any more books involving that character. Furthermore, the novels have gone out of print, and are only available from amazon.com as used books. Unfortunately, I travel a lot. So much so, that one of the few times that I have time to read is when I’m traveling. And I really dislike having to haul dead-tree versions of my favorite novels around; they take up far too much weight and space in my carry-on luggage. Unfortunately, these out-of-print novels were published by a Neanderthal Publishing company who hasn’t made any of the books available in ebook format, DRM’ed or no. Grumpy, I searched on Internet, and found all three novels were easily available for free download — in a pirated form, of course. Should I download them and convert them into a form which would allow me to read them on my Sony eReader? Well, according to Russell Davis, former chair (and now president of the Science Fiction Writer’s Association) of the SFWA’s Copyright Committee, “electronic infringement is theft”. From a legal perspective, I suppose that is true. And given that as an Open Source programmer, I depend on Copyright Law to assure that my wishes as an author are upheld, it would be hypocritical for me assume that I should be able to ignore Copyright Law just because it is inconvenient. And yet… from a moral perspective, who has really lost anything? The argument made by Russell Davis is that infringement is bad because it is “harming authors and author estates”. Jerry Pournelle has indignantly proclaimed that e-piracy goes against a “specific (and very stern) Biblical injunction against stealing from widows and orphans”. Of course, in this case, the author is still alive (and is female, although I suppose stealing from widowers would be just as bad). Also, given that the author has publically stated she doesn’t plan to write any more books involving this character (since some of her more psychotic readers sent her death threats as a result of reading said books), the publisher is highly unlikely to re-release said novels — and if I buy used dead-tree versions of said novels, the author doesn’t receive any additional royalties. So, then, where is the moral bright line?

Should I purchase a used dead-tree copy of the novel, and lug it around, inconveniencing me, causing more CO2 emissions by shipping the book to me, and in the airplane because of its added weight, as the only way I can comply with copyright law? Furthermore, should I do this to set an example to all of the younger generations that are treating copyright law much more casually, much as essentially all drivers casually ignore the law’s dictats to not drive faster than the speed limit? (Many have argued that the current state of affairs with respect to music and etexts and copyright law is bad because it encourages people to not respect the rule of law — I guess, as the argument goes, if people don’t respect the copyright law, what’s next? Torturing prisoners in Guantanomo in violation of the law? Oh, wait… too late…)
Should I purchase a used dead-tree copy of the novel, slice the binding off, and then run pages through a scanner and an OCR program, then spend hours reformatting it into an .LRF file so I can read it on the Sony eReader? Would that be considered fair use?
What if I purchase a used dead-tree copy of the novel, but to save the time and effort of scanning the pages and correcting the OCR errors, download the pirated e-text, and convert it into an .LRF file and enjoy it on my Sony eReader?
What if I don’t purchase the dead-tree copy of the book, download the pirated e-text, and send a money order (so it can’t be traced) for roughly the same amount of money as the cost of the used dead-tree version of the book to the author, with a letter explaining why she was receiving this check?
What if I just download the pirated e-text, justifying my actions that no one is actually getting hurt my downloading the text and reading it; after all, since it is long out of print and not available from any booksellers as a new book, the author isn’t going to be getting any more royalties anyway.

Somewhere along this continuum, we’ve crossed over from the light-side to the dark-side. Setting aside the observation that the Neanderthal attitudes and business practices of the publisher involved has made it impossible for me to legitimiately follow the law, enjoy the novels, and direct money to the author via royalty payments — what do you think is the morally correct course of action? And why? And if you don’t mind saying so publically, roughly what generation are you from (i.e., Baby Boomer, Gen X, Gen Y, etc.)? I’m curious how attitudes are changing based on age, and whether folks who are currently in college might differ from those who can remember a time when the Internet didn’t exist… Update: I’ve posted a follow-up to this post here.